Search CORE

146 research outputs found

Natural Language Generation enhances human decision-making with uncertain information

Author: Gkatzia Dimitra
Lemon Oliver
Rieser Verena
Publication venue
Publication date: 01/01/2016
Field of study

Decision-making is often dependent on uncertain data, e.g. data associated with confidence scores or probabilities. We present a comparison of different information presentations for uncertain data and, for the first time, measure their effects on human decision-making. We show that the use of Natural Language Generation (NLG) improves decision-making under uncertainty, compared to state-of-the-art graphical-based representation methods. In a task-based study with 442 adults, we found that presentations using NLG lead to 24% better decision-making on average than the graphical presentations, and to 44% better decision-making when NLG is combined with graphics. We also show that women achieve significantly better results when presented with NLG output (an 87% increase on average compared to graphical presentations).Comment: 54th annual meeting of the Association for Computational Linguistics (ACL), Berlin 201

arXiv.org e-Print Archive

Heriot Watt Pure

RankME: Reliable Human Ratings for Natural Language Generation

Author: Dušek Ondřej
Novikova Jekaterina
Rieser Verena
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 15/03/2018
Field of study

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.Comment: Accepted to NAACL 2018 (The 2018 Conference of the North American Chapter of the Association for Computational Linguistics

arXiv.org e-Print Archive

Heriot Watt Pure

Referenceless Quality Estimation for Natural Language Generation

Author: Dušek Ondřej
Novikova Jekaterina
Rieser Verena
Publication venue
Publication date: 05/08/2017
Field of study

Traditional automatic evaluation measures for natural language generation (NLG) use costly human-authored references to estimate the quality of a system output. In this paper, we propose a referenceless quality estimation (QE) approach based on recurrent neural networks, which predicts a quality score for a NLG system output by comparing it to the source meaning representation only. Our method outperforms traditional metrics and a constant baseline in most respects; we also show that synthetic data helps to increase correlation results by 21% compared to the base system. Our results are comparable to results obtained in similar QE tasks despite the more challenging setting.Comment: Accepted as a regular paper to 1st Workshop on Learning to Generate Natural Language (LGNL), Sydney, 10 August 201

arXiv.org e-Print Archive

Heriot Watt Pure

Findings of the E2E NLG Challenge

Author: Dušek Ondřej
Novikova Jekaterina
Rieser Verena
Publication venue
Publication date: 01/01/2018
Field of study

This paper summarises the experimental setup and results of the first shared task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue systems. Recent end-to-end generation systems are promising since they reduce the need for data annotation. However, they are currently limited to small, delexicalised datasets. The E2E NLG shared task aims to assess whether these novel approaches can generate better-quality output by learning from a dataset containing higher lexical richness, syntactic complexity and diverse discourse phenomena. We compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures -- with the majority implementing sequence-to-sequence models (seq2seq) -- as well as systems based on grammatical rules and templates.Comment: Accepted to INLG 201

arXiv.org e-Print Archive

Crossref

Data-driven Natural Language Generation: Paving the Road to Success

Author: Dušek Ondřej
Novikova Jekaterina
Rieser Verena
Publication venue
Publication date: 28/06/2017
Field of study

We argue that there are currently two major bottlenecks to the commercial use of statistical machine learning approaches for natural language generation (NLG): (a) The lack of reliable automatic evaluation metrics for NLG, and (b) The scarcity of high quality in-domain corpora. We address the first problem by thoroughly analysing current evaluation metrics and motivating the need for a new, more reliable metric. The second problem is addressed by presenting a novel framework for developing and evaluating a high quality corpus for NLG training.Comment: WiNLP workshop at ACL 201

arXiv.org e-Print Archive

Heriot Watt Pure

Crowd-sourcing NLG Data: Pictures Elicit Better Data

Author: Lemon Oliver
Novikova Jekaterina
Rieser Verena
Publication venue
Publication date: 01/01/2016
Field of study

Recent advances in corpus-based Natural Language Generation (NLG) hold the promise of being easily portable across domains, but require costly training data, consisting of meaning representations (MRs) paired with Natural Language (NL) utterances. In this work, we propose a novel framework for crowdsourcing high quality NLG training data, using automatic quality control measures and evaluating different MRs with which to elicit data. We show that pictorial MRs result in better NL data being collected than logic-based MRs: utterances elicited by pictorial MRs are judged as significantly more natural, more informative, and better phrased, with a significant increase in average quality ratings (around 0.5 points on a 6-point scale), compared to using the logical MRs. As the MR becomes more complex, the benefits of pictorial stimuli increase. The collected data will be released as part of this submission.Comment: The 9th International Natural Language Generation conference INLG, 2016. 10 pages, 2 figures, 3 table

arXiv.org e-Print Archive

Crossref

Heriot Watt Pure

The E2E Dataset: New Challenges For End-to-End Generation

Author: Dušek Ondřej
Novikova Jekaterina
Rieser Verena
Publication venue
Publication date: 01/01/2017
Field of study

This paper describes the E2E data, a new dataset for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area. The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection. As such, learning from this dataset promises more natural, varied and less template-like system utterances. We also establish a baseline on this dataset, which illustrates some of the difficulties associated with this data.Comment: Accepted as a short paper for SIGDIAL 2017 (final submission including supplementary material

arXiv.org e-Print Archive

Crossref

Heriot Watt Pure

A Review of Evaluation Techniques for Social Dialogue Systems

Author: Curry Amanda Cercas
Hastie Helen
Rieser Verena
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/08/2017
Field of study

In contrast with goal-oriented dialogue, social dialogue has no clear measure of task success. Consequently, evaluation of these systems is notoriously hard. In this paper, we review current evaluation methods, focusing on automatic metrics. We conclude that turn-based metrics often ignore the context and do not account for the fact that several replies are valid, while end-of-dialogue rewards are mainly hand-crafted. Both lack grounding in human perceptions.Comment: 2 page

arXiv.org e-Print Archive

Crossref

Heriot Watt Pure

Better Conversations by Modeling,Filtering,and Optimizing for Coherence and Diversity

Author: Dušek Ondřej
Konstas Ioannis
Rieser Verena
Xu Xinnuo
Publication venue
Publication date: 01/01/2018
Field of study

We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically coherent and lexically diverse context-response pairs, (3) we then train a response generator using a conditional variational autoencoder model that incorporates the measure of coherence as a latent variable and uses a context gate to guarantee topical consistency with the context and promote lexical diversity. Experiments on the OpenSubtitles corpus show a substantial improvement over competitive neural models in terms of BLEU score as well as metrics of coherence and diversity

arXiv.org e-Print Archive

Heriot Watt Pure

Crossref